• Final Project
  • Fall 2021, DSPA (HS650)
  • Name: Alxandr Kane York
  • SID: #### - 6909 (last 4 digits only)
  • UMich E-mail:
  • I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.
  • Remember that students are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

1 Abstract

It is important for physicians to be able to make predictions regarding a patient’s risk for chronic diseases. Demographics and certain behaviors can predispose someone to having certain disorders. This project aims to predict the presence of angina, stroke, and diabetes based on a number of demographic and behavioral measures. Specifically, the project uses machine learning based approaches to both predict patients’ with chronic diseases and narrow down which specific features are predictive of these diseases.

2 Introduction

According to information aggregated by the CDC, in a given year, 659,000 people die from heart disease, 795,000 suffer from a stroke, and more than 87,000 die from diabetes (Virani et al., 2021). Given the high prevalence of each of these conditions, it is paramount for clinicians to have models that can accurately predict whether a given patient is likely to suffer either one. The aim of this project is to determine if risk-related behaviors are indicative of these conditions. It is hypothesized that with this given data, machine-learning based classifiers will be able to accurately predict patients with chronic diseases (angina, stroke, and diabetes), and that feature selection methods (linear regression, recursive feature elimination, random forest etc.) will narrow down the list of important predictors for each condition.

3 Methods

This project uses a data-set containing categorical variables related to demographics, health-related behaviors, and history of angina, stroke, and diabetes. To determine what features are most important in predicting the chronic diseases, random forest classification, recursive feature elimination, and linear regressions are used. Additionally, naive Bayes, linear discriminant analysis, and decision trees are used to predict if one of the three conditions for a given patient are present.

4 Loading and Preprocessing the Data

Lets load the data in and preprocess as needed.

library(plotly)
library(MASS)
library(nnet)
library(caret)
library(corrplot)
library(randomForest)
library(stats)
library(tm)
library(SnowballC)
library(biclust)
library(tidyverse)
library(psych)
library(mi)
library(car)
library(plotly)
library(gmodels)
library(e1071)
library(C50)
library(Boruta)
library(Rcpp)

# Loading the data in, changing the data to factor variables, and getting some summary information

health <- read.csv("CaseStudy09_HealthBehaviorRisks_Data_V2.csv")
summary(health)
##        ID             AGE_G           SEX           RACEGR3     
##  Min.   :   1.0   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 250.8   1st Qu.:3.00   1st Qu.:1.000   1st Qu.:1.000  
##  Median : 500.5   Median :5.00   Median :2.000   Median :1.000  
##  Mean   : 500.5   Mean   :4.34   Mean   :1.574   Mean   :1.345  
##  3rd Qu.: 750.2   3rd Qu.:6.00   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :1000.0   Max.   :6.00   Max.   :2.000   Max.   :9.000  
##     IMPEDUC         IMPMRTL         EMPLOY1          INCOMG     
##  Min.   :2.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :5.000   Median :1.000   Median :7.000   Median :4.000  
##  Mean   :4.889   Mean   :2.216   Mean   :5.541   Mean   :4.337  
##  3rd Qu.:6.000   3rd Qu.:3.000   3rd Qu.:8.000   3rd Qu.:5.000  
##  Max.   :6.000   Max.   :6.000   Max.   :8.000   Max.   :9.000  
##     CVDINFR4        CVDCRHD4        CVDSTRK3        DIABETE3    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :2.000   Median :1.000  
##  Mean   :1.996   Mean   :1.813   Mean   :1.905   Mean   :1.487  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :4.000  
##     RFSMOK3         RFDRHV4          FRTLT1          VEGLT1     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.319   Mean   :1.257   Mean   :1.736   Mean   :1.715  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##     TOTINDA     
##  Min.   :1.000  
##  1st Qu.:1.000  
##  Median :1.000  
##  Mean   :1.574  
##  3rd Qu.:2.000  
##  Max.   :9.000
colnames(health) <- c("ID","Age","Sex","Race","School","Marriage","Employment","Income","HeartAttack","Angina","Stroke","Diabetes",
                      "Smoking","Alcohol","Fruits","Vegetables","Leisure")
health <- subset(health, select = -ID)

for(i in 1:length(colnames(health))){
  
  health[,i] <- factor(health[,i])
}

summary(health$sex)
## Length  Class   Mode 
##      0   NULL   NULL
summary(health$Angina)
##   1   2 
## 187 813
summary(health$HeartAttack) ## Heart attack has too few cases to do an analysis
##   1   2 
##   4 996
summary(health$Stroke)
##   1   2 
##  95 905
summary(health$Diabetes)
##   1   2   3   4 
## 745  48 182  25

Angina, heart attacks, strokes, and diabetes are the main conditions we can predict using the risk factors in this data-set. However, only 4 people responded saying that they have had a heart attack previously. This is not enough data to make any reasonable predictions. Luckily, a higher number of people reported having the other conditions so, those conditions can be used for further analysis.

5 Results

5.1 Precition Models (Bayes, LDA, Decision Trees)

## Prediction Models (Bayes, LDA, Decision Trees)
healthSubset <- sample(nrow(health),floor(nrow(health)*.60))
healthTrain <- health[healthSubset, ]
healthTest <- health[-healthSubset, ]

anginaBayesModel <- naiveBayes(healthTrain, healthTrain$Angina, type = 'class')
anginaPred <- predict(anginaBayesModel, healthTest)
anginaBayesCT <- CrossTable(anginaPred,healthTest$Angina)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##              | healthTest$Angina 
##   anginaPred |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |        76 |         1 |        77 | 
##              |   257.435 |    60.386 |           | 
##              |     0.987 |     0.013 |     0.192 | 
##              |     1.000 |     0.003 |           | 
##              |     0.190 |     0.002 |           | 
## -------------|-----------|-----------|-----------|
##            2 |         0 |       323 |       323 | 
##              |    61.370 |    14.395 |           | 
##              |     0.000 |     1.000 |     0.807 | 
##              |     0.000 |     0.997 |           | 
##              |     0.000 |     0.807 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        76 |       324 |       400 | 
##              |     0.190 |     0.810 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
plot_ly(x = c("TN", "FN", "FP", "TP"),
        y = c(anginaBayesCT$prop.row[1,1], anginaBayesCT$prop.row[1,2], anginaBayesCT$prop.row[2,1], anginaBayesCT$prop.row[2,2]),
        name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
         legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))
strokeBayesModel <- naiveBayes(healthTrain, healthTrain$Stroke, type = 'class')
strokePred <- predict(strokeBayesModel, healthTest)
strokeBayesCT <- CrossTable(strokePred,healthTest$Stroke)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##              | healthTest$Stroke 
##   strokePred |         1 |         2 | Row Total | 
## -------------|-----------|-----------|-----------|
##            1 |        40 |         0 |        40 | 
##              |   314.344 |    35.900 |           | 
##              |     1.000 |     0.000 |     0.100 | 
##              |     0.976 |     0.000 |           | 
##              |     0.100 |     0.000 |           | 
## -------------|-----------|-----------|-----------|
##            2 |         1 |       359 |       360 | 
##              |    34.927 |     3.989 |           | 
##              |     0.003 |     0.997 |     0.900 | 
##              |     0.024 |     1.000 |           | 
##              |     0.002 |     0.897 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        41 |       359 |       400 | 
##              |     0.102 |     0.897 |           | 
## -------------|-----------|-----------|-----------|
## 
## 
plot_ly(x = c("TN", "FN", "FP", "TP"),
        y = c(strokeBayesCT$prop.row[1,1], strokeBayesCT$prop.row[1,2], strokeBayesCT$prop.row[2,1], strokeBayesCT$prop.row[2,2]),
        name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
         legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))
diabetesBayesModel <- naiveBayes(healthTrain, healthTrain$Diabetes, type = 'class')
diabetesPred <- predict(diabetesBayesModel, healthTest)
CrossTable(diabetesPred,healthTest$Diabetes)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##              | healthTest$Diabetes 
## diabetesPred |         1 |         2 |         3 |         4 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            1 |       300 |         1 |         0 |         1 |       302 | 
##              |    23.851 |    16.928 |    47.565 |     7.170 |           | 
##              |     0.993 |     0.003 |     0.000 |     0.003 |     0.755 | 
##              |     1.000 |     0.040 |     0.000 |     0.083 |           | 
##              |     0.750 |     0.002 |     0.000 |     0.002 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            2 |         0 |        23 |         0 |         0 |        23 | 
##              |    17.250 |   323.438 |     3.623 |     0.690 |           | 
##              |     0.000 |     1.000 |     0.000 |     0.000 |     0.058 | 
##              |     0.000 |     0.920 |     0.000 |     0.000 |           | 
##              |     0.000 |     0.058 |     0.000 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            3 |         0 |         1 |        63 |         0 |        64 | 
##              |    48.000 |     2.250 |   277.830 |     1.920 |           | 
##              |     0.000 |     0.016 |     0.984 |     0.000 |     0.160 | 
##              |     0.000 |     0.040 |     1.000 |     0.000 |           | 
##              |     0.000 |     0.002 |     0.158 |     0.000 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##            4 |         0 |         0 |         0 |        11 |        11 | 
##              |     8.250 |     0.688 |     1.732 |   344.997 |           | 
##              |     0.000 |     0.000 |     0.000 |     1.000 |     0.028 | 
##              |     0.000 |     0.000 |     0.000 |     0.917 |           | 
##              |     0.000 |     0.000 |     0.000 |     0.028 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |       300 |        25 |        63 |        12 |       400 | 
##              |     0.750 |     0.062 |     0.158 |     0.030 |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
## 
ldaAngina <- lda(data=healthTrain, Angina~.)
ldaAnginaPred <- predict(ldaAngina,healthTest)
anginaCT <- CrossTable(ldaAnginaPred$class,healthTest$Angina)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##                     | healthTest$Angina 
## ldaAnginaPred$class |         1 |         2 | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   1 |        59 |        15 |        74 | 
##                     |   143.642 |    33.694 |           | 
##                     |     0.797 |     0.203 |     0.185 | 
##                     |     0.776 |     0.046 |           | 
##                     |     0.147 |     0.037 |           | 
## --------------------|-----------|-----------|-----------|
##                   2 |        17 |       309 |       326 | 
##                     |    32.606 |     7.648 |           | 
##                     |     0.052 |     0.948 |     0.815 | 
##                     |     0.224 |     0.954 |           | 
##                     |     0.042 |     0.772 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |        76 |       324 |       400 | 
##                     |     0.190 |     0.810 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
plot_ly(x = c("TN", "FN", "FP", "TP"),
        y = c(anginaCT$prop.row[1,1], anginaCT$prop.row[1,2], anginaCT$prop.row[2,1], anginaCT$prop.row[2,2]),
        name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
         legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))
ldaStroke<- lda(data=healthTrain, Stroke~.)
ldaStrokePred <- predict(ldaStroke,healthTest)
strokeCT <- CrossTable(ldaStrokePred$class,healthTest$Stroke)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##                     | healthTest$Stroke 
## ldaStrokePred$class |         1 |         2 | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   1 |        34 |         6 |        40 | 
##                     |   218.051 |    24.903 |           | 
##                     |     0.850 |     0.150 |     0.100 | 
##                     |     0.829 |     0.017 |           | 
##                     |     0.085 |     0.015 |           | 
## --------------------|-----------|-----------|-----------|
##                   2 |         7 |       353 |       360 | 
##                     |    24.228 |     2.767 |           | 
##                     |     0.019 |     0.981 |     0.900 | 
##                     |     0.171 |     0.983 |           | 
##                     |     0.018 |     0.882 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |        41 |       359 |       400 | 
##                     |     0.102 |     0.897 |           | 
## --------------------|-----------|-----------|-----------|
## 
## 
plot_ly(x = c("TN", "FN", "FP", "TP"),
        y = c(strokeCT$prop.row[1,1], strokeCT$prop.row[1,2], strokeCT$prop.row[2,1], strokeCT$prop.row[2,2]),
        name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>% 
  layout(title="Confusion Matrix", 
         legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))
ldaDiabetes<- lda(data=healthTrain, Diabetes~.)
ldaDiabetesPred <- predict(ldaDiabetes,healthTest)
diabetesCT <- CrossTable(ldaDiabetesPred$class,healthTest$Diabetes)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  400 
## 
##  
##                       | healthTest$Diabetes 
## ldaDiabetesPred$class |         1 |         2 |         3 |         4 | Row Total | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                     1 |       272 |        22 |        26 |         6 |       326 | 
##                       |     3.093 |     0.130 |    12.511 |     1.461 |           | 
##                       |     0.834 |     0.067 |     0.080 |     0.018 |     0.815 | 
##                       |     0.907 |     0.880 |     0.413 |     0.500 |           | 
##                       |     0.680 |     0.055 |     0.065 |     0.015 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                     2 |         0 |         1 |         0 |         0 |         1 | 
##                       |     0.750 |    14.062 |     0.158 |     0.030 |           | 
##                       |     0.000 |     1.000 |     0.000 |     0.000 |     0.002 | 
##                       |     0.000 |     0.040 |     0.000 |     0.000 |           | 
##                       |     0.000 |     0.002 |     0.000 |     0.000 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                     3 |        26 |         2 |        37 |         6 |        71 | 
##                       |    13.945 |     1.339 |    59.606 |     7.031 |           | 
##                       |     0.366 |     0.028 |     0.521 |     0.085 |     0.177 | 
##                       |     0.087 |     0.080 |     0.587 |     0.500 |           | 
##                       |     0.065 |     0.005 |     0.092 |     0.015 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##                     4 |         2 |         0 |         0 |         0 |         2 | 
##                       |     0.167 |     0.125 |     0.315 |     0.060 |           | 
##                       |     1.000 |     0.000 |     0.000 |     0.000 |     0.005 | 
##                       |     0.007 |     0.000 |     0.000 |     0.000 |           | 
##                       |     0.005 |     0.000 |     0.000 |     0.000 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##          Column Total |       300 |        25 |        63 |        12 |       400 | 
##                       |     0.750 |     0.062 |     0.158 |     0.030 |           | 
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 
## 
DecisionTreeAngina <- C5.0(healthTrain[,-9], healthTrain$Angina)
DTAPred <- predict(DecisionTreeAngina,healthTest[,-9])
confusionMatrix(table(DTAPred,healthTest$Angina))
## Confusion Matrix and Statistics
## 
##        
## DTAPred   1   2
##       1  60  18
##       2  16 306
##                                           
##                Accuracy : 0.915           
##                  95% CI : (0.8832, 0.9404)
##     No Information Rate : 0.81            
##     P-Value [Acc > NIR] : 3.706e-09       
##                                           
##                   Kappa : 0.7266          
##                                           
##  Mcnemar's Test P-Value : 0.8638          
##                                           
##             Sensitivity : 0.7895          
##             Specificity : 0.9444          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.9503          
##              Prevalence : 0.1900          
##          Detection Rate : 0.1500          
##    Detection Prevalence : 0.1950          
##       Balanced Accuracy : 0.8670          
##                                           
##        'Positive' Class : 1               
## 
DecisionTreeStroke <- C5.0(healthTrain[,-10], healthTrain$Stroke)
DTSPred <- predict(DecisionTreeStroke,healthTest[,-10])
confusionMatrix(table(DTSPred,healthTest$Stroke))
## Confusion Matrix and Statistics
## 
##        
## DTSPred   1   2
##       1  34   8
##       2   7 351
##                                           
##                Accuracy : 0.9625          
##                  95% CI : (0.9389, 0.9789)
##     No Information Rate : 0.8975          
##     P-Value [Acc > NIR] : 1.126e-06       
##                                           
##                   Kappa : 0.7984          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8293          
##             Specificity : 0.9777          
##          Pos Pred Value : 0.8095          
##          Neg Pred Value : 0.9804          
##              Prevalence : 0.1025          
##          Detection Rate : 0.0850          
##    Detection Prevalence : 0.1050          
##       Balanced Accuracy : 0.9035          
##                                           
##        'Positive' Class : 1               
## 
DecisionTreeDiabetes <- C5.0(healthTrain[,-11], healthTrain$Diabetes)
DTDPred <- predict(DecisionTreeDiabetes, healthTest[,-11])
confusionMatrix(table(DTDPred,healthTest$Diabetes))
## Confusion Matrix and Statistics
## 
##        
## DTDPred   1   2   3   4
##       1 271  20  32   5
##       2   0   0   0   0
##       3  29   5  31   7
##       4   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.755           
##                  95% CI : (0.7098, 0.7964)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.4349          
##                                           
##                   Kappa : 0.3131          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9033   0.0000   0.4921     0.00
## Specificity            0.4300   1.0000   0.8783     1.00
## Pos Pred Value         0.8262      NaN   0.4306      NaN
## Neg Pred Value         0.5972   0.9375   0.9024     0.97
## Prevalence             0.7500   0.0625   0.1575     0.03
## Detection Rate         0.6775   0.0000   0.0775     0.00
## Detection Prevalence   0.8200   0.0000   0.1800     0.00
## Balanced Accuracy      0.6667   0.5000   0.6852     0.50

Naive Bayes and LDA achieve high true negative and true positive scores for predicting angina and stroke as well high accuracies when decision trees are used. Each the models has a high accuracy for the diabetes variable as well.

5.2 Feature Selection

## Logistic Regressions
anginaLM <- glm(Angina ~ ., data = health, family = "binomial")
summary(anginaLM)
## 
## Call:
## glm(formula = Angina ~ ., family = "binomial", data = health)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4759   0.0000   0.0000   0.0462   3.4413  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.679e+01  1.318e+04   0.001 0.998983    
## Age2          6.230e+00  1.263e+00   4.931 8.16e-07 ***
## Age3          9.018e+00  1.481e+00   6.087 1.15e-09 ***
## Age4          1.199e+01  1.666e+00   7.195 6.25e-13 ***
## Age5          1.622e+01  2.141e+00   7.573 3.63e-14 ***
## Age6          4.158e+01  2.363e+03   0.018 0.985963    
## Sex2         -1.829e+00  5.013e-01  -3.649 0.000263 ***
## Race2        -9.829e-01  9.167e-01  -1.072 0.283609    
## Race3        -8.209e-01  1.192e+00  -0.689 0.491064    
## Race4         6.718e+00  6.856e+00   0.980 0.327136    
## Race5        -2.352e+00  2.023e+00  -1.162 0.245042    
## Race9        -6.370e-02  1.275e+00  -0.050 0.960150    
## School3      -1.737e+01  1.318e+04  -0.001 0.998948    
## School4      -2.082e+01  1.318e+04  -0.002 0.998739    
## School5      -2.321e+01  1.318e+04  -0.002 0.998595    
## School6      -2.457e+01  1.318e+04  -0.002 0.998512    
## Marriage2    -2.785e-01  5.513e-01  -0.505 0.613454    
## Marriage3     3.870e-01  6.845e-01   0.565 0.571852    
## Marriage4    -1.546e+01  1.601e+03  -0.010 0.992298    
## Marriage5     3.313e-01  6.096e-01   0.543 0.586829    
## Marriage6     2.449e-02  9.976e-01   0.025 0.980418    
## Employment2   2.813e+00  2.500e+00   1.125 0.260580    
## Employment3  -1.959e+00  1.462e+00  -1.340 0.180261    
## Employment4  -3.354e+00  1.176e+00  -2.851 0.004355 ** 
## Employment5  -7.093e-01  1.115e+00  -0.636 0.524676    
## Employment6  -1.641e+00  1.697e+00  -0.967 0.333564    
## Employment7  -9.317e-01  6.465e-01  -1.441 0.149553    
## Employment8  -1.028e+00  5.656e-01  -1.817 0.069238 .  
## Income2      -2.436e-01  8.518e-01  -0.286 0.774941    
## Income3       1.170e-04  9.685e-01   0.000 0.999904    
## Income4      -6.148e-01  7.638e-01  -0.805 0.420811    
## Income5       4.917e-01  6.980e-01   0.704 0.481196    
## Income9       1.396e+00  8.940e-01   1.561 0.118455    
## HeartAttack2 -2.684e+00  1.753e+00  -1.531 0.125794    
## Stroke2       6.602e-01  9.955e-01   0.663 0.507247    
## Diabetes2    -1.168e+00  7.027e-01  -1.661 0.096628 .  
## Diabetes3     1.436e+01  2.447e+03   0.006 0.995319    
## Diabetes4     1.155e+01  6.140e+03   0.002 0.998499    
## Smoking2      4.874e+00  9.731e-01   5.009 5.48e-07 ***
## Smoking9      2.076e+00  1.333e+00   1.558 0.119241    
## Alcohol2      4.398e+00  1.745e+00   2.520 0.011745 *  
## Alcohol9      1.495e+00  1.291e+00   1.158 0.246773    
## Fruits2       1.612e+00  4.979e-01   3.238 0.001205 ** 
## Fruits9      -1.226e+00  1.031e+00  -1.189 0.234391    
## Vegetables2   2.313e+00  6.204e-01   3.728 0.000193 ***
## Vegetables9  -4.775e-01  7.541e-01  -0.633 0.526553    
## Leisure2      2.131e+00  6.307e-01   3.379 0.000728 ***
## Leisure9     -2.230e+00  1.373e+00  -1.624 0.104401    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 963.69  on 999  degrees of freedom
## Residual deviance: 188.18  on 952  degrees of freedom
## AIC: 284.18
## 
## Number of Fisher Scoring iterations: 21
x <- summary(anginaLM)$coefficients[,4] < .05
anginaLMSig <- names(summary(anginaLM)$coefficients[x,4] < .05) ## Saving those varaibles that significant for future variable selection method comparisons

strokeLM <- glm(Stroke ~ ., data = health, family = "binomial")
summary(strokeLM)
## 
## Call:
## glm(formula = Stroke ~ ., family = "binomial", data = health)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.701   0.000   0.000   0.000   2.397  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   6.575e+00  2.341e+04   0.000 0.999776    
## Age2          6.692e+00  1.365e+00   4.903 9.42e-07 ***
## Age3          1.482e+01  3.009e+00   4.927 8.37e-07 ***
## Age4          3.289e+01  4.074e+03   0.008 0.993559    
## Age5          3.551e+01  3.309e+03   0.011 0.991439    
## Age6          3.436e+01  2.757e+03   0.012 0.990057    
## Sex2          3.718e-01  7.262e-01   0.512 0.608654    
## Race2        -9.632e-02  2.671e+00  -0.036 0.971234    
## Race3         6.705e+00  2.715e+00   2.470 0.013516 *  
## Race4         2.212e-01  5.137e+01   0.004 0.996564    
## Race5         1.586e+00  3.227e+00   0.491 0.623205    
## Race9        -2.058e+00  1.565e+00  -1.315 0.188573    
## School3      -6.445e+00  2.341e+04   0.000 0.999780    
## School4      -1.063e+01  2.341e+04   0.000 0.999638    
## School5      -1.240e+01  2.341e+04  -0.001 0.999577    
## School6      -1.665e+01  2.341e+04  -0.001 0.999433    
## Marriage2     4.557e-02  1.047e+00   0.044 0.965298    
## Marriage3    -5.448e-01  9.836e-01  -0.554 0.579645    
## Marriage4     3.147e+00  8.088e+00   0.389 0.697185    
## Marriage5    -2.088e+00  1.072e+00  -1.948 0.051375 .  
## Marriage6     3.642e+00  2.224e+00   1.638 0.101523    
## Employment2   2.135e-01  5.747e+00   0.037 0.970358    
## Employment3   6.420e+00  2.168e+00   2.961 0.003069 ** 
## Employment4  -6.157e+00  2.513e+00  -2.450 0.014279 *  
## Employment5   6.330e+00  1.988e+00   3.184 0.001451 ** 
## Employment6   1.814e+01  1.565e+04   0.001 0.999075    
## Employment7   2.207e+00  1.160e+00   1.903 0.056992 .  
## Employment8   2.405e+00  9.568e-01   2.514 0.011938 *  
## Income2      -4.978e+00  1.847e+00  -2.695 0.007041 ** 
## Income3      -3.992e+00  1.894e+00  -2.107 0.035098 *  
## Income4      -5.276e+00  1.860e+00  -2.836 0.004561 ** 
## Income5      -4.874e+00  1.648e+00  -2.958 0.003093 ** 
## Income9      -2.482e+00  1.597e+00  -1.554 0.120079    
## HeartAttack2  1.979e+00  1.932e+00   1.025 0.305522    
## Angina2       4.649e-01  1.282e+00   0.362 0.716981    
## Diabetes2    -1.031e+00  1.314e+00  -0.785 0.432664    
## Diabetes3     1.084e+01  3.299e+03   0.003 0.997379    
## Diabetes4    -1.419e+00  1.168e+04   0.000 0.999903    
## Smoking2      5.190e+00  1.500e+00   3.460 0.000541 ***
## Smoking9     -3.707e+00  1.965e+00  -1.887 0.059168 .  
## Alcohol2     -1.583e+00  1.804e+00  -0.877 0.380374    
## Alcohol9      2.200e-01  6.036e+00   0.036 0.970929    
## Fruits2       1.399e+00  7.798e-01   1.794 0.072844 .  
## Fruits9      -4.617e-01  1.680e+00  -0.275 0.783448    
## Vegetables2   3.971e+00  1.116e+00   3.557 0.000375 ***
## Vegetables9  -1.493e+00  1.263e+00  -1.182 0.237295    
## Leisure2      3.652e+00  1.050e+00   3.478 0.000504 ***
## Leisure9     -1.794e+00  1.808e+00  -0.992 0.321001    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 627.912  on 999  degrees of freedom
## Residual deviance:  88.892  on 952  degrees of freedom
## AIC: 184.89
## 
## Number of Fisher Scoring iterations: 22
x <- summary(strokeLM)$coefficients[,4] < .05
strokeLMSig <- names(summary(strokeLM)$coefficients[x,4] < .05) ## Saving those varaibles that significant for future variable selection method comparisons

# Multinomial Linear Regression
diabetesLM <- multinom(Diabetes ~ ., data = health)
## # weights:  188 (138 variable)
## initial  value 1386.294361 
## iter  10 value 485.423916
## iter  20 value 400.229607
## iter  30 value 391.035328
## iter  40 value 389.248554
## iter  50 value 388.918363
## iter  60 value 388.755521
## iter  70 value 388.588214
## iter  80 value 388.554762
## iter  90 value 388.456470
## iter 100 value 388.446908
## final  value 388.446908 
## stopped after 100 iterations
summary(diabetesLM)
## Call:
## multinom(formula = Diabetes ~ ., data = health)
## 
## Coefficients:
##   (Intercept)      Age2      Age3      Age4       Age5       Age6       Sex2
## 2  -23.845640  0.228492 0.1890488  1.102002 -0.1193996  0.1000789 12.2415970
## 3  -10.123903  7.526140 8.3406602 10.293995 11.9331364 13.3489896 -0.8916207
## 4   -6.205156 -4.471158 8.2636577 10.222035 10.7712852 12.8469054 -0.7550098
##         Race2      Race3       Race4        Race5    Race9    School3
## 2   1.3270821 -10.911943 -10.1172718 -10.23349935 2.448592 -13.897770
## 3   0.3640057  -2.268112   0.1320356  -0.01904884 2.338475  -4.437903
## 4 -12.3052437 -17.965338  -7.2198419  -9.19831731 4.208112  -8.111618
##      School4     School5     School6  Marriage2    Marriage3    Marriage4
## 2  0.5390254   0.6464011   0.8929987 -0.6486809 -1.623709531   0.18674280
## 3 -6.2207739  -8.9783005 -11.6298273 -0.1458774 -0.001513464  -0.07101547
## 4 -8.6991456 -10.5154197 -14.3019830 -0.8386655 -0.452226177 -13.82239802
##     Marriage5    Marriage6   Employment2 Employment3 Employment4 Employment5
## 2 -0.54514462  -1.33242728   0.010669566   0.3828141  -10.066596   0.3708215
## 3 -0.94079389  -0.07848923  -0.003073865  -1.1689414    1.254144   0.6417032
## 4  0.01474081 -10.15974267 -10.079925026   1.6887563   -9.293229   0.4854404
##   Employment6 Employment7 Employment8     Income2    Income3    Income4
## 2   1.7835135  -0.3997252   0.6013019 -0.40677611 -2.4787821  0.1220308
## 3   0.5621604   0.1089195  -0.4876101 -0.04393232  0.5190654 -0.4356655
## 4  -9.4477142   0.5840835  -0.4333128  0.76629283  1.2107869  0.6719532
##      Income5     Income9 HeartAttack2    Angina2    Stroke2  Smoking2
## 2 -0.4871224 -0.05776236    7.0416523 -0.7617192  1.8094802 0.2649059
## 3 -0.3083258  0.34726289   -0.7082079  3.5307840  0.2570182 3.2793888
## 4  1.4394547  1.63426503   -3.0432397  4.7325083 -2.4347133 3.5799597
##      Smoking9   Alcohol2   Alcohol9   Fruits2   Fruits9 Vegetables2 Vegetables9
## 2   0.3495308 -0.7915249 -1.3527356 0.1618617 1.0243905   0.7599804   0.8458477
## 3   0.2566212  2.4709877  0.9318741 1.4147436 0.4395236   1.1527861   0.7630708
## 4 -10.2700939  2.3654527 -9.7233765 1.2501587 0.7499800   0.9894764 -10.5734441
##    Leisure2   Leisure9
## 2 0.7770696 -0.8974989
## 3 2.4250543 -0.5097800
## 4 1.4348926 -0.2750364
## 
## Std. Errors:
##   (Intercept)      Age2       Age3       Age4       Age5       Age6       Sex2
## 2   111.12640  1.107685   1.169325   1.208408   1.273265   1.268669 64.6700930
## 3    96.20731 92.981636  92.980287  92.979495  92.979599  92.979702  0.2752929
## 4   127.19162 74.372250 123.056849 123.063549 123.064328 123.064456  0.5258059
##         Race2        Race3      Race4      Race5    Race9   School3   School4
## 2   0.5264059 1.642437e+02 180.365712 173.269557 1.034636 0.2920375 47.893981
## 3   0.5068182 1.222413e+00   1.049532   1.231685 1.105757 2.6216619  2.591366
## 4 176.6119494 9.528947e-05 145.206562 206.141227 1.265192 2.9627181  2.815176
##     School5   School6 Marriage2 Marriage3    Marriage4 Marriage5   Marriage6
## 2 47.894122 47.893165 0.5516379 0.7974547 1.3808488802 0.5229251   1.1805731
## 3  2.634921  2.696002 0.4230981 0.3987608 1.5308131868 0.4095321   0.7007587
## 4  2.870052  3.194266 0.9117707 0.8174344 0.0002699382 0.6715403 144.7473168
##   Employment2 Employment3 Employment4 Employment5 Employment6 Employment7
## 2   0.9569608    1.197182  193.653144   0.9093629   1.0911126   0.5920207
## 3   0.6754031    1.328496    1.041996   0.7145045   0.8090641   0.3882478
## 4 157.8802172    1.577241  233.680212   1.2734154 205.8712207   0.6650961
##   Employment8   Income2   Income3   Income4   Income5   Income9 HeartAttack2
## 2   0.4538096 0.6316236 1.1421688 0.6147420 0.5590656 0.6638731     76.63196
## 3   0.3373228 0.5470485 0.5594095 0.5348256 0.4963819 0.5807680     18.70123
## 4   0.6326563 1.3110992 1.4828685 1.3040046 1.2511020 1.3597623     17.49951
##      Angina2   Stroke2  Smoking2    Smoking9  Alcohol2    Alcohol9   Fruits2
## 2  0.5709167  1.086667 0.5347810   0.9680688 1.1686414   1.3473229 0.4018891
## 3 15.5643487 13.845363 0.3998010   0.9646119 0.4967988   0.8098973 0.2990350
## 4 63.8326779 36.321014 0.6515794 176.5866845 0.9322985 169.0603771 0.5310582
##     Fruits9 Vegetables2 Vegetables9  Leisure2  Leisure9
## 2 0.5891724   0.4246063   0.6591481 0.4300163 1.1801469
## 3 0.6486549   0.3050981   0.5492398 0.3361840 0.6796632
## 4 1.2982978   0.5659303 172.8184802 0.6016299 1.2437793
## 
## Residual Deviance: 776.8938 
## AIC: 1052.894
z <- summary(diabetesLM)$coefficients/summary(diabetesLM)$standard.errors #Getting z-scores
p <- (1 - pnorm(abs(z),0,1))*2 ## Getting the p-values
p
##   (Intercept)      Age2      Age3      Age4      Age5      Age6        Sex2
## 2   0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3   0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4   0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
##        Race2      Race3     Race4     Race5        Race9     School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
##       School4      School5      School6 Marriage2  Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
##    Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550   0.9911042   0.7491483   0.9585426   0.6834340
## 3 0.02160503 0.9108186   0.9963687   0.3789144   0.2287452   0.3691272
## 4 0.98248725 0.9440428   0.9490933   0.2843032   0.9682773   0.7030464
##   Employment6 Employment7 Employment8   Income2    Income3   Income4   Income5
## 2   0.1021366   0.4995564   0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3   0.4871631   0.7790623   0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4   0.9633968   0.3798384   0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
##     Income9 HeartAttack2   Angina2    Stroke2     Smoking2  Smoking9
## 2 0.9306650    0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817    0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117    0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
##       Alcohol2  Alcohol9      Fruits2    Fruits9  Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985   0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317   0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932   0.9512140
##       Leisure2  Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
## Feature Selection
BorutaAngina <- Boruta(Angina ~., data = health, doTrace=0)
BorutaAngina
## Boruta performed 99 iterations in 5.156267 secs.
##  8 attributes confirmed important: Age, Alcohol, Diabetes, Leisure,
## School and 3 more;
##  6 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race and 1 more;
##  1 tentative attributes left: Fruits;
df_long <- tidyr::gather(as.data.frame(BorutaAngina$ImpHistory), feature, measurement)

plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
  layout(title="Box-and-whisker Plots across all Features",
         xaxis = list(title="Features"),
         yaxis = list(title="Importance"),
         showlegend=F)
BorutaAnginaVars <- getSelectedAttributes(BorutaAngina)


control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainAngina <- rfe(healthTrain[,-9], healthTrain[,9],sizes=c(10, 15), rfeControl=control)
rf.trainAngina
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##         10   0.9300 0.7467    0.02212 0.08561        *
##         15   0.9299 0.7413    0.01572 0.06246         
## 
## The top 5 variables (out of 10):
##    Age, Stroke, Smoking, Sex, School
rfAnginaVars <- predictors(rf.trainAngina)

BorutaAnginaVars
## [1] "Age"      "Sex"      "School"   "Stroke"   "Diabetes" "Smoking"  "Alcohol" 
## [8] "Leisure"
rfAnginaVars
##  [1] "Age"        "Stroke"     "Smoking"    "Sex"        "School"    
##  [6] "Diabetes"   "Leisure"    "Alcohol"    "Income"     "Vegetables"
anginaLMSig
##  [1] "Age2"        "Age3"        "Age4"        "Age5"        "Sex2"       
##  [6] "Employment4" "Smoking2"    "Alcohol2"    "Fruits2"     "Vegetables2"
## [11] "Leisure2"
anginaOverlap <- c("Age","Smoking","Alcohol","Leisure") # For predicting angina, these are the features that are in common for all of the feature selection method used.


BorutaStroke <- Boruta(Stroke ~., data = health, doTrace=0)
BorutaStroke
## Boruta performed 99 iterations in 3.536886 secs.
##  6 attributes confirmed important: Age, Angina, Diabetes, Income,
## School and 1 more;
##  7 attributes confirmed unimportant: Alcohol, Employment, Fruits,
## HeartAttack, Marriage and 2 more;
##  2 tentative attributes left: Leisure, Smoking;
df_long <- tidyr::gather(as.data.frame(BorutaStroke$ImpHistory), feature, measurement)

plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
  layout(title="Box-and-whisker Plots across all Features",
         xaxis = list(title="Features"),
         yaxis = list(title="Importance"),
         showlegend=F)
BorutaStrokeVars <- getSelectedAttributes(BorutaStroke)

control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainStroke <- rfe(healthTrain[,-10], healthTrain[,10],sizes=c(10, 15), rfeControl=control)
rf.trainStroke
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##         10   0.9552 0.7179    0.02576  0.1691        *
##         15   0.9535 0.7013    0.02159  0.1433         
## 
## The top 5 variables (out of 10):
##    Age, Angina, Sex, Diabetes, School
rfStrokeVars <- predictors(rf.trainStroke)

BorutaStrokeVars
## [1] "Age"      "Sex"      "School"   "Income"   "Angina"   "Diabetes"
rfStrokeVars
##  [1] "Age"         "Angina"      "Sex"         "Diabetes"    "School"     
##  [6] "Leisure"     "Income"      "Smoking"     "Fruits"      "HeartAttack"
strokeLMSig
##  [1] "Age2"        "Age3"        "Race3"       "Employment3" "Employment4"
##  [6] "Employment5" "Employment8" "Income2"     "Income3"     "Income4"    
## [11] "Income5"     "Smoking2"    "Vegetables2" "Leisure2"
strokeOverlap <- c("Age","Income") # For predicting stroke, these are the features that are in common for all of the feature selection method used.

BorutaDiabetes <- Boruta(Diabetes ~., data = health, doTrace=0)
BorutaDiabetes
## Boruta performed 99 iterations in 7.435833 secs.
##  8 attributes confirmed important: Age, Alcohol, Angina, Fruits,
## Leisure and 3 more;
##  5 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race;
##  2 tentative attributes left: Sex, Vegetables;
df_long <- tidyr::gather(as.data.frame(BorutaDiabetes$ImpHistory), feature, measurement)

plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
  layout(title="Box-and-whisker Plots across all Features",
         xaxis = list(title="Features"),
         yaxis = list(title="Importance"),
         showlegend=F)
BorutaDiabetesVars <- getSelectedAttributes(BorutaDiabetes)

control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainDiabetes <- rfe(healthTrain[,-11], healthTrain[,11],sizes=c(10, 15), rfeControl=control)
rf.trainDiabetes
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD KappaSD Selected
##         10   0.8251 0.5074    0.02563 0.08916        *
##         15   0.8152 0.4213    0.02857 0.10381         
## 
## The top 5 variables (out of 10):
##    School, Age, Smoking, Angina, Leisure
rfDiabetesVars <- predictors(rf.trainDiabetes)

BorutaDiabetesVars
## [1] "Age"     "School"  "Angina"  "Stroke"  "Smoking" "Alcohol" "Fruits" 
## [8] "Leisure"
rfDiabetesVars
##  [1] "School"     "Age"        "Smoking"    "Angina"     "Leisure"   
##  [6] "Stroke"     "Sex"        "Race"       "Vegetables" "Fruits"
p
##   (Intercept)      Age2      Age3      Age4      Age5      Age6        Sex2
## 2   0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3   0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4   0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
##        Race2      Race3     Race4     Race5        Race9     School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
##       School4      School5      School6 Marriage2  Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
##    Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550   0.9911042   0.7491483   0.9585426   0.6834340
## 3 0.02160503 0.9108186   0.9963687   0.3789144   0.2287452   0.3691272
## 4 0.98248725 0.9440428   0.9490933   0.2843032   0.9682773   0.7030464
##   Employment6 Employment7 Employment8   Income2    Income3   Income4   Income5
## 2   0.1021366   0.4995564   0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3   0.4871631   0.7790623   0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4   0.9633968   0.3798384   0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
##     Income9 HeartAttack2   Angina2    Stroke2     Smoking2  Smoking9
## 2 0.9306650    0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817    0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117    0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
##       Alcohol2  Alcohol9      Fruits2    Fruits9  Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985   0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317   0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932   0.9512140
##       Leisure2  Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
diabetesOverlap <- c('School','Smoking',"Alcohol",'Leisure') # For predicting diabetes, these are the features that are in common for all of the feature selection method used.

6 Discussion

6.1 Conclusions

Each of the classification methods obtained high accuracy in predicting whether a given patient had one of the chronic conditions. However, the cohorts were unbalanced, with healthy patients far out weighing the ones with a chronic condition. Future work should attempt collect more balanced cohorts. Demographic and behavioral variables important for predicting angina include age, smoking, alcohol, and leisure; for stroke, age and income; and for diabetes, smoking, education, alcohol, and leisure. These chosen variables are based on the overlap between the three feature selection methods, but it might be reasonable to include other variables that overlap between only two of the feature selection methods. For example, smoking was identified by logistic regression and RFE to be an important predictor for stroke but not by the random forest algorithm. Future work would benefit from including more behaviors, both helpful and negative for the given conditions.

6.2 Acknowledgements

Thank you to Dr. Dinov for providing the data-set.

6.3 References

Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics—2021 update: a report from the American Heart Associationexternal icon. Circulation. 2021;143:e254–e743.